Objective throughout this whole class: Predict a response variable \(Y\) using predictors \(X_1, X_2, \dots, X_p\).
Previously we have looked at global models much of the time.
We assumed there was some formula of the predictors that would model the values of the response:
\[ \mu_{y{\:\vert\:}x} = f({{\underline{X}}}) \]
We assume this model holds over the whole space of our predictors.
We have gone over a couple non-parametric methods: kernel smoothing and KNN.
Next we will look at Classification and Regression trees.
What patterns do you see in median housing prices?
calif <- read.table("http://www.stat.cmu.edu/~cshalizi/350/hw/06/cadata.dat",
header = TRUE)
head(calif)
price.deciles <- quantile(calif$MedianHouseValue, 0:10/10)
cut.prices <- cut(calif$MedianHouseValue, price.deciles, include.lowest = TRUE)
plot(calif$Longitude, calif$Latitude, col = grey(10:2/11)[cut.prices], pch = 20,
xlab = "Longitude", ylab = "Latitude")
We create a binary tree:
Create a node. Choose a predictor \(X_i\) and split it into two regions, \(X_k < s\) and \(X_s \geq s\).
Within each of those regions, choose a another predictor (or the same one), and split it into two regions.
Continue choosing predictor variables and splitting their regions until you reach an end point (you stop creating regions). This is known as a leaf.
Average the set of \(Y\) observations within the set of regions that define a leaf and average them together, (or look at the majority in classification). The average (or majority) is the predicted value of the response.
### Libraries !!!
library(tree)
treefit <- tree(log(MedianHouseValue) ~ Longitude + Latitude, data = calif)
plot(treefit)
text(treefit, cex = 0.75)
price.deciles <- quantile(calif$MedianHouseValue, 0:10/10)
cut.prices <- cut(calif$MedianHouseValue, price.deciles, include.lowest = TRUE)
plot(calif$Longitude, calif$Latitude, col = grey(10:2/11)[cut.prices], pch = 20,
xlab = "Longitude", ylab = "Latitude")
partition.tree(treefit, ordvars = c("Longitude", "Latitude"), add = TRUE)
The flexibility of trees is going to be based on the number of nodes. More nodes means we are dividing the predictor space into small and smaller pieces.
treefit2 <- tree(log(MedianHouseValue) ~ Longitude + Latitude, data = calif,
mindev = 0.005)
plot(treefit2)
text(treefit2, cex = 0.6)
plot(calif$Longitude, calif$Latitude, col = grey(10:2/11)[cut.prices], pch = 20,
xlab = "Longitude", ylab = "Latitude")
partition.tree(treefit2, ordvars = c("Longitude", "Latitude"), add = TRUE)
# treefit2
treefit3 <- tree(log(MedianHouseValue) ~ Longitude + Latitude, data = calif,
mindev = 0.001)
plot(treefit3)
text(treefit3, cex = 0.4)
plot(calif$Longitude, calif$Latitude, col = grey(10:2/11)[cut.prices], pch = 20,
xlab = "Longitude", ylab = "Latitude")
partition.tree(treefit3, ordvars = c("Longitude", "Latitude"), add = TRUE)
# treefit3
We are not restricted to variables that literally divide based on geometric regions.
treefit4 <- tree(log(MedianHouseValue) ~ ., data = calif)
plot(treefit4)
text(treefit4, cex = 0.75)
We divide the predictor space (the set of possible values for \(X_1, \dots, X_p\)) into \(M\) distinct and non-overlapping regions: \(R_1, R_2, \dots, R_M\)
For every observation that falls into region \(R_m\), we make the same prediction, which is simply the mean of the response values for the training observations in \(R_m\)
But, how do we choose the regions?
Regions could be any shape, in theory. They just have to not overlap for the algorithm to work.
We instead choose to divide the predictor space into high-dimensional rectangles. Just call them boxes.
The goal would be to choose the boxes \(R_1, \dots, R_M\) to minimize:
\[ RSS = \sum_{m=1}^M\sum_{x_i\in R_m}\left(y_i - \widehat{y}_{R_m}\right)^2 \]
Linear regression assumes the form:
\[f({{\underline{X}}}) = \beta_0 + \sum_{j=1}^p X_j \beta_j\]
Whereas regression trees assume the form:
\[ f({{\underline{X}}}) = \sum_{m=1}^M c_m\cdot I({{\underline{X}}}\in R_m) \]
It is computationally infeasible to consider every possible choice of regions.
We instead take the top down approach described earlier. We call this process Recursive Binary Splitting
For each predictor \(X_j\) choose a cutpoint \(s_j\) such that splitting based off the regions \(X_j < s_j\) and \(X_j \geq s_j\) that results in the greatest possible reduction in the RSS for that predictor.
Use the predictor \(X_j\) and cutpoint \(s_j\) that resulted in the greatest reduction in RSS among the predictors and other cutpoints.
Repeat within each region, i.e. within the region \(X_j < s\) and \(X_j \geq s\). For each predictor, find the cutpoint. Choose predictor with cutpoint that reduces RSS by most.
Stop until all \(Y\) observations in a region are the same value, or there is a small number of response variables within each region, or the change in RSS is below some stopping criterion.
We choose a single variable to make a cut without looking at the any future cuts. This is aspect of the process is referred to as greedy.
Greedy means we try to grab as much reduction in the RSS as possible at a single cut within the tree instead of trying to reduce the RSS across all possible trees.
This greedy approach will lead to trees with too many leaves that overfits to the training data.
One approach to this issue of greediness is to take an initial large (many branches) tree \(T_0\) and remove or prune the branches of \(T_0\) until we end up with a smaller subtree \(T\) that minimizes a penalized RSS.
\[ RSS(\alpha) = \sum_{m=1}^M\sum_{x_i\in R_m}\left(y_i - \widehat{y}_{R_m}\right)^2 + \alpha|T| \]
What happens if \(\alpha = 0\)?
As we increase \(\alpha\) we penalize a tree more and more for the number of leaves that it has. This reduces the number of branches or prunes the tree in a nested and predictable manner.
This process is referred to as cost complexity pruning.
Use recursive binary splitting to grow a large tree on the training data. Stop when each termminal node has fewer than some minimum number of observations.
Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of \(\alpha\).
Use K-fold of cross-validation to choose \(\alpha\). Divide the training observations into \(K\) folds. For each \(k = 1, \dots, K\):
Repeat steps 1 and 2 on all but the \(k\)th fold of the training data.
Evalute the mean squared prediction error on the data in the left-out \(k\) fold, as a function of \(\alpha\).
Average the results for each value of \(\alpha\) and pick \(\alpha\) to minimize the average error.
In practice, this is automatically done by the function cv.tree from library(tree).
calitree <- tree(log(MedianHouseValue) ~ ., data = calif)
summary(calitree)
##
## Regression tree:
## tree(formula = log(MedianHouseValue) ~ ., data = calif)
## Variables actually used in tree construction:
## [1] "MedianIncome" "Latitude" "Longitude" "MedianHouseAge"
## Number of terminal nodes: 15
## Residual mean deviance: 0.1321 = 2724 / 20620
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.86000 -0.22650 -0.01475 0.00000 0.20740 2.03900
plot(calitree)
text(calitree, cex = 0.75)
cv.cali = cv.tree(calitree)
cv.df <- data.frame(size = cv.cali$size, mse = cv.cali$dev)
ggplot(cv.df, aes(x = size, y = mse)) + geom_line() + geom_point()
In the case of this data, the most complex tree was chosen.
We can force pruning.
prune.cali = prune.tree(calitree, best = 10)
plot(prune.cali)
text(prune.cali, cex = 0.7)
summary(prune.cali)
##
## Regression tree:
## snip.tree(tree = calitree, nodes = c(18L, 7L, 42L, 6L, 8L))
## Variables actually used in tree construction:
## [1] "MedianIncome" "Latitude" "Longitude"
## Number of terminal nodes: 10
## Residual mean deviance: 0.1509 = 3112 / 20630
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.64100 -0.25140 -0.01448 0.00000 0.25220 2.03900
prune.cali
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 20640 6685.00 12.08
## 2) MedianIncome < 3.5471 10381 2662.00 11.77
## 4) MedianIncome < 2.51025 4842 1194.00 11.58
## 8) Latitude < 34.465 2322 450.20 11.78 *
## 9) Latitude > 34.465 2520 557.80 11.39
## 18) Longitude < -120.275 1792 386.00 11.51 *
## 19) Longitude > -120.275 728 77.14 11.08 *
## 5) MedianIncome > 2.51025 5539 1120.00 11.94
## 10) Latitude < 37.925 4435 901.70 12.01
## 20) Longitude < -122.235 351 47.73 12.48 *
## 21) Longitude > -122.235 4084 770.70 11.97
## 42) Latitude < 34.455 2814 410.80 12.06 *
## 43) Latitude > 34.455 1270 284.70 11.77
## 86) Longitude < -120.385 783 112.10 12.01 *
## 87) Longitude > -120.385 487 50.04 11.37 *
## 11) Latitude > 37.925 1104 123.70 11.68 *
## 3) MedianIncome > 3.5471 10259 1975.00 12.40
## 6) MedianIncome < 5.5892 7265 1156.00 12.26 *
## 7) MedianIncome > 5.5892 2994 298.70 12.75 *
Trees “easily” adapt to classification problems. The only real difference in the outcome is going to be now the response is categorical instead of continuous.
data("iris")
iris.tree <- tree(Species ~ Petal.Length + Petal.Width, data = iris)
plot(iris.tree)
text(iris.tree, cex = 0.75)
plot(iris$Petal.Length,iris$Petal.Width,pch=19,col=as.numeric(iris$Species))
partition.tree(iris.tree,label="Species",add=TRUE)
legend(1.75,4.5,legend=unique(iris$Species),col=unique(as.numeric(iris$Species)))
There are many choices for a metric. Let \(p_{mk}\) be the proportion of training observations in the \(m^{th}\) region that are from the \(k^{th}\) class.
| classification error rate: | \(E = 1 - \max_k (\widehat{p}_{mk})\) |
| Gini index: | \(G = \sum_k \widehat{p}_{mk}(1-\widehat{p}_{mk})\) |
| cross-entropy: | \(D = -\sum_k \widehat{p}_{mk}\log(\widehat{p}_{mk})\) |
Both Gini and cross-entropy can be thought of as measuring the purity of the classifier (small if all \(p_{mk}\) are near zero or 1). These are preferred over the classification error rate.
We build a classifier by growing a tree that minimizes \(E\), \(G\) or \(D\).
R will default to a deviance that is similar to the entropy criterion (deviance) but can also use the Gini index
iris.tree.ent <- tree(Species ~ Petal.Length + Petal.Width, data = iris, split = "deviance")
iris.tree.gini <- tree(Species ~ Petal.Length + Petal.Width, data = iris, split = "gini")
plot(iris.tree.ent)
text(iris.tree.ent, cex = 0.75)
plot(iris.tree.gini)
text(iris.tree.gini, cex = 0.75)
plot(iris$Petal.Length,iris$Petal.Width,pch=19,col=as.numeric(iris$Species))
partition.tree(iris.tree.ent,label="Species",add=TRUE)
legend(1.75,4.5,legend=unique(iris$Species),col=unique(as.numeric(iris$Species)))
plot(iris$Petal.Length,iris$Petal.Width,pch=19,col=as.numeric(iris$Species))
partition.tree(iris.tree.gini,label="Species",add=TRUE)
legend(1.75,4.5,legend=unique(iris$Species),col=unique(as.numeric(iris$Species)))
This method, like with regression trees, will tend to overfit the data.
Our solutions are the same as before. Weakest Link Pruning
\[ \sum_{m=1}^M\sum_{x_i\in R_m}\left(y_i \neq \widehat{y}_{R_m}\right)^2 + \alpha|T| \]
Cross validation is used to choose the best \(\alpha\).
set.seed(121)
big.tree = tree(Species ~ ., data = iris)
tree.cv = cv.tree(big.tree, K=5)
best.k = tree.cv$k[which.min(tree.cv$dev)]
pruned = prune.tree(big.tree, k=best.k)
par(mfrow=c(1,2),mar=c(0,0,0,0))
plot(big.tree)
text(big.tree, cex=.7)
plot(pruned)
text(pruned, cex=.7)